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ABSTRACT 

Spam, also known as Unsolicited Commercial Email (UCE), 
is the bane of email communication. Many data mining 
researchers have addressed the problem of detecting spam, 
generally by treating it as a static text classification prob- 
lem. True in vivo spam filtering has characteristics that 
make it a rich and challenging domain for data mining. In- 
deed, real-world datasets with these characteristics are typ- 
ically difficult to acquire and to share. This paper demon- 
strates some of these characteristics and argues that re- 
searchers should pursue in vivo spam filtering as an accessi- 
ble domain for investigating them. 

General Terms 

spam, text classification, challenge problems, class skew, im- 
balanced data, cost-sensitive learning, data streams, concept 
drift 

1. INTRODUCTION 

Spam, also known as Unsolicited Commercial Email (UCE) 
and Unsolicited Bulk Email (UBE), is commonplace every- 
where in email communication 1 . Spam is a costly problem 
and many experts agree it is only getting worse |71 1241 EHI 1341 
I14| . Because of the economics of spam and the difficulties 
inherent in stopping it, it is unlikely to go away soon. 
Many data mining and machine learning researchers have 
worked on spam detection and filtering, commonly treat- 
ing it as a basic text classification problem. The problem 
is popular enough that it has been the subject of a Data 
Mining Cup contest I1U| as well as numerous class projects. 
Bayesian analysis has been very popular )28ll3UllTSl l3l. but 
researchers have also used SVMs 1201 . decisions trees [I], 
memory and case-based reasoning )29l 151. rule learning |27| 
and even genetic programming I19| . 

But researchers who treat spam filtering as an isolated text 
classification task have only addressed a portion of the prob- 
lem. This paper argues that real- world in vivo spam filtering 
is a rich and challenging problem for data mining. By "in 

1 The term "spam" is sometimes used loosely to mean any 
message broadcast to multiple senders (regardless of intent) 
or any message that is undesired. Here we intend the nar- 
rower, stricter definition: unsolicited commercial email sent 
to an account by a person unacquainted with the recipient. 



vivo" we mean the problem as it is truly faced in an op- 
erating environment, that is, by an on-line filter on a mail 
account that receives realistic feeds of email over time, and 
serves a human user. In this context, spam filtering faces 
issues of skewed and changing class distributions; unequal 
and uncertain error costs; complex text patterns; a complex, 
disjunctive and drifting target concept; and challenges of 
intelligent, adaptive adversaries. Many real-world domains 
share these characteristics and would benefit indirectly by 
work on spam filtering. 

Improving spam filtering is a worthy goal in itself, but this 
paper takes the (admittedly selfish) position that data min- 
ing researchers should study the problem for the benefit 
of data mining. It is unclear whether spam filtering ef- 
forts could genuinely benefit from data mining research. On 
the other hand, one of the persistent difficulties of research 
in many real-world domains is that of acquiring and shar- 
ing datasets. Most companies, for example, do not release 
customer transaction data; we are aware of no public do- 
main datasets containing genuine fraudulent transactions for 
studying fraud detection. Even sharing such data between 
partner companies usually requires formal non-disclosure 
agreements. In other domains datasets may still have copy- 
right or privacy issues. Few datasets involving concept drift 
or changing class distributions are publicly available. With- 
out such datasets, the ability to replicate results and to 
compare algorithm performance is hindered and progress on 
these research topics will be impaired. Spam data are easily 
accessible and shareable, which makes spam filtering a good 
domain testbed for investigating many of the same issues. 
The remainder of the paper enumerates these research issues 
and describes how they are manifested in in vivo spam filter- 
ing. The final section of the paper discusses how researchers 
could begin exploring the domain. 

2. CHALLENGES 

2.1 Skewed and drifting class distributions 

Like most text classification domains, spam presents the 
problem of a skewed class distribution, i.e., the proportion of 
spam to legitimate email is uneven. There are no generally 
agreed upon class priors for this problem. Gomez Hidalgo 
15; points out that the proportion of spam messages re- 
ported in research datasets varies considerably, from 16.6% 
to 88.2%. This may be simply because the proportion varies 
considerably from one individual to another. The amount 
of spam received depends on the email address, the degree 
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Figure 1: Weekly variation in message traffic, spam versus legitimate email 




Figure 2: SpamCop: Spam forwarded and reports sent 



of exposure, the amount of time the address has been pub- 
lic and the upstream filtering. The amount of legitimate 
email received similarly varies greatly from one individual 
to another. 

Perhaps more importantly, spam varies over time as well. 
This was demonstrated dramatically in 2002 when a large 
number of open relays and open proxies were brought on-line 
in Asian countries, primarily Korea and China. Such a large 
new pool of unprotected machines provided great opportu- 
nities for spammers, and soon email servers throughout the 
world experienced a huge surge in the amount of spam they 
forwarded and received. The problem became so bad that 
for a brief time all email from certain Asian countries was 
blocked completely by some ISPs 

In spite of claims that spam is generally increasing [7| 1241 
13, the volume varies considerably and non-monotonically 
on a daily or weekly scale. Calculating spam proportion 
even approximately is difficult. Although some public spam 
datasets are available (see Appendix A), we are aware of no 
personal email datasets arranged over time, so it is difficult 
to match the two to establish priors. Nevertheless, using 
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Figure 3: Drifting priors: weekly estimates of p(spam) taken 
from data in figure 1. 



several datasets we can make a case that spam priors change 
significantly over time. 

Figure 0i shows a graph of spam volume received in 2002 
by Paul Wouters of Xtended Internet 2 . In 2002 the spam 
volume was 146 ± 55 messages per week, indicating a great 
deal of variation in spite of its upward trend. For most 
people, the volume of the legitimate email received varies as 
well. Figure 03 shows a graph of the number of legitimate 
messages saved by the author over the weeks in 2002. The 
volume is 12.3 ± 6.4 messages per week. 
Figure shows the volume of reports issued from Spam- 
Cop's website 3 This graph also demonstrates some of spam's 
episodic nature. SpamCop is a service used by many peo- 
ple to filter spam and to submit reports (complaints) to the 
originators of spam. Both the amount of spam submitted 
and the number of reports sent show clear episodic behavior. 
These graphs show time variation in both the volume of 
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Figure 4: Email volume from Eide's trial, (a) absolute volume, (b) resulting prior p(spam). 




spam and the volume of legitimate email received, something 
that researchers have not generally acknowledged. Since the 
two sources of email — senders of spam and senders of legiti- 
mate email — are independent parties with little in common, 
we can expect their variation to be statistically uncorrelated, 
and the class priors will vary over time. No fixed prior will 
be correct. 

How much could we expect class priors to vary? If we assume 
that a user received the spam shown in figure and the 
legitimate email shown in figure^), we can estimate the class 
prior p(spam) simply as the proportion of weekly messages 
that are spam. Figure shows a graph of this value, which 
ranges between about .67 to .99. 

A further demonstration of changing priors appears in Kris- 
tian Eide's study of bayesian spam filters |12) . In evaluating 
these filters he measured the volume of spam and legitimate 
email he received over the course of one month. These vol- 
umes are graphed in figure^Ji, and the computed daily spam 
prior is graphed in figure The prior ranges from .32 to 
.9, showing greater variation than in figure [3] though the 
skew is not as high. 

Variation in class priors may be problematic for researchers 
because it makes solution superiority more difficult to es- 
tablish. A classifier that performs better than another on 
a dataset with 80% spam may perform worse on one with 
40% spam H3. 

Should researchers be concerned about these varying class 
priors? This question is difficult to answer conclusively be- 
cause it depends on classifier performance as well as error 
cost assumptions (discussed in section l2~^jl . But by employ- 
ing the cost curve framework of Drummond and Holte 
we can answer a related question, How much of cost space is 
influenced by Ms variation? This question can be answered 
by calculating the span of the Probability Cost Function 
(PCF) , which is the x axis of a cost curve. The PCF ranges 
from zero to one and is a function of the class prevalence 
and the ratio of misclassification error costs. In the case of 
spam: 



If we assume that the cost of a false positive (that is, of 
classifying a legitimate message as spam) is about ten times 
that of a false negative, this reduces to 
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Using the p(spam) range from figure 2J>, the PCF range of 
interest for spam filtering is .04 < PCF spam < .47. Since the 
entire PCF range is [0,1], this means nearly half of cost space 
is influenced by this variation in priors. Any classifier whose 
performance lies within this 44% could be a competitive 
solution. This is a wide range, and it is reasonable to expect 
classifier superiority to vary within it. 

The purpose of this analysis is not to call into question the 
validity of prior work, but to point out that changing class 
distributions are a reality in this domain and their influence 
on solutions should be tested. Conversely, researchers in- 
vestigating skewed and varying class distributions would do 
well to study the spam filtering problem. 
Exactly how a researcher should best track and adjust class 
priors is an open question and will require research. Time 
series work in statistics should provide some strategies, for 
example, using an exponentially decayed average of recent 
priors. However researchers estimate priors, they should ac- 
knowledge that priors vary and static values are unrealistic. 

2.2 Unequal and uncertain error costs 

A further complication of in vivo filtering is the asymmetry 
of error costs. Viewing the filter as a spam classifier, a spam 
message is a positive instance and a legitimate message is 
a negative instance. Judging a legitimate email to be spam 
(a false positive error) is usually far worse than judging a 
spam email to be legitimate (a false negative error) . A false 
negative simply causes slight irritation, i.e., the user sees 
an undesirable message. A false positive can be critical. 
If spam is deleted permanently from a mail server, a false 
positive can be very expensive since it means a (possibly 
important) message has been discarded without a trace. If 
spam is moved to a low-priority mail folder for later human 
scanning, or if the address is only used to receive low priority 



email, false positives may be much more tolerable. 
In an essay on developing a bayesian spam filter, Paul Gra- 
ham I16| describes the different errors in an insightful com- 
ment: 

False positives seem to me a different kind of er- 
ror from false negatives. Filtering rate is a mea- 
sure of performance. False positives I consider 
more like bugs. I approach improving the fil- 
tering rate as optimization, and decreasing false 
positives as debugging. 

Ken Schneider, CTO of the mail filtering company Bright- 
Mail, makes the same point more starkly I31| . He argues 
that filtering even a small amount of legitimate email de- 
feats the purpose of filtering because it forces the user to 
start reviewing the spam folder for missed messages. Even a 
single missed important message may cause a user to recon- 
sider the value of spam filtering. This argues for assigning 
a very high cost to false positive errors. 
Regardless of the exact values, these asymmetric error costs 
must be acknowledged and taken into account by any ac- 
ceptable filtering solution. Judging a spam filtering system 
by accuracy (or, equivalently, error rate) is unrealistic and 
misleading 1261 . Some researchers have measured precision 
and recall without questioning whether metrics for informa- 
tion retrieval are appropriate for a filtering task. 
Fortunately, most researchers have acknowledged these asym- 
metric costs, but methods for dealing with them have been 
ad hoc. The 2003 Data Mining Cup Competition 1101 re- 
quired that learned classifiers have no more than a 1% false 
positive rate, but the organizers gave no justification for 
this cut-off. Graham 1161 simply double-counted the tokens 
of his legitimate email, essentially considering the cost of a 
false positive to be twice that of a false negative. Sahami 
et al. 1281 used a very high probability threshold of .999 for 
classifying a message as spam. Androutsopoulos et al. [5] 
performed more careful experiments across cost ratios of 1, 
10 and 100, exploring two orders of magnitude of cost ratios. 
These approaches suggest a deeper issue: true costs of fil- 
tering errors may simply be unknown to the data mining 
researcher, or may be known only approximately. Only the 
end user will know the consequences of filtering mistakes 
and be able to estimate error tradeoffs. In vivo filtering 
requires flexibility of solutions: the user should be able to 
specify the approximate costs (or relative severity) of the 
errors and the run-time filter should accommodate. Admit- 
tedly this requirement complicates research evaluation since 
the superiority of an approach may not extend throughout 
a cost range, and multiple experiments may have to be per- 
formed. 

Such uncertainty is actually common in real- world domains, 
where experts may have difficulty stating the exact cost of 
an erroneous action, or the cost of the action may vary de- 
pending on external circumstances. This situation moti- 
vated development of a framework based on ROC analysis 
for evaluating and managing classifiers when error costs are 
uncertain I25| . In the case of spam filtering, the uncertainty 
of error costs may not change temporally but they do vary 
between users. Gomez Hidalgo 1151 used this framework 
for developing and evaluating spam filtering solutions, and 
found it useful. Drummond and Holte |11| have also devel- 
oped a cost curve framework that extends ROC analysis and 
serves much the same purpose. Whatever technique is used 



for evaluating classifier performance, researchers should be 
prepared to demonstrate a solution's performance over a 
range of costs. 

2.3 Disjunctive and changing target concept 

Section 12.11 made the case that the amount of spam drifts 
over time, so class distributions vary. It is also true that 
the content of spam changes over time, so class-conditioned 
feature probabilities will change as well. 
Some spam topics are perpetual, such as advertisements 
for pornography sites, offers for mortgage re-financing, and 
moneymaking schemes. Other topics are bursty or occur in 
epidemics. 

One notorious example of a spam ploy coming into vogue is 
the "Nigerian Money" scam, a get-rich-quick scam in which 
help was solicited to transfer money from a Nigerian bank 
account 1321 . The details varied, but the sender usually 
claimed to be responsible for a large bank account and re- 
quested assistance in "liberating" the funds from the Nige- 
rian government. The sender was willing to pay generously 
for access to a foreign bank account into which the money 
would be transfered. This account was usually drained of 
funds once access was granted. Eventually the people re- 
sponsible for the scam were arrested, and spam of that type 
declined quickly (unfortunately, variants continue to circu- 
late as other people adopt the general idea). Prior to this 
scam, keywords such as nigeria and assistance were not 
strong predictors of spam. 

A more dramatic episode occurred in April of 2003 when 
decks of playing cards depicting "Iraq's Most Wanted" were 
made available for sale. These cards were advertised pri- 
marily via spam. The advertising campaign created such a 
spam blizzard that its story — and the campaign's success — 
were written up in the New York Times 1181 . This campaign 
abated quickly and few of the terms uniquely associated with 
this episode retain much predictive power now. 
The point for researchers is that spam content changes so 
the "spam" concept should drift inevitably. Some compo- 
nents (disjuncts) of the concept description should remain 
constant or change only slowly. Others will spike during 
epidemics, as specific scams or merchandising schemes come 
into vogue. Even perpetual topics do not exhibit constant 
term frequencies. 

It is difficult to estimate how much we can expect "spam" 
as a concept to drift over time, in part because no metric 
of concept drift has been adopted by the community. It is 
beyond the scope of this paper to present a rigorous investi- 
gation of concept drift in spam, but a simple technique can 
demonstrate significant word frequency variation. 
Swan and Allan |33l employed a y 2 test to discover "bursty" 
topics in daily news stories. Their test was designed to deter- 
mine whether the appearance of a term on a given day was 
statistically significant. This test can be applied to weekly 
groups of spam messages in Wouters' archive, using the av- 
erage weekly frequency of a term as its expected value. The 
results are shown in figure with selected terms listed on 
the left side and a column for every week (1-52) of 2002 
extending to the right. The height of a bar at a term-week 
is proportional to the term's frequency in that week. The 
special symbol "B" denotes a term burst: it appears in a 
term's row if that term appeared more than four times in 
the week and the \ 2 test succeeded at p < 0.01. 
Figure shows that spam has complex time- varying behav- 
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Figure 5: Frequency and burstiness of spam terms. 



ior. Some terms recur intermittently, such as adult, click, 
free, hot and removed. Others are episodic, e.g., the terms 
common in a "Nigerian scam" burst in week 18 (nigeria, la- 
gos, assistance, beneficiary) and terms in a "pornstar videos" 
burst in weeks 31-32 (Orgy, awesome, pornstars, jenna, lau- 
ren, nicki). The term Christmas bursts late in the year and 
presumably reappears every year around the same time. 
Spam behavior is not simply a matter of one concept drifting 
to another in succession, but instead is a superimposition of 
constant, periodic and episodic phenomena. Researchers in 
data mining have studied classification under concept drift 
but it remains an open problem. Work in Topic Detection 
and Tracking is likely to be relevant to spam classifica- 
tion, though technically it addresses a different problem. No 
detailed study of real-world concept drift has yet been un- 
dertaken, and to the best of our knowledge there are no stan- 
dard datasets for studying it. A longitudinal spam dataset 



would be an excellent testbed for investigating issues in con- 
cept drift and stream classification. 

2.4 Intelligent adaptive adversaries 

The spam stream changes over time as different products 
or scams, marketed by spam, come into vogue. There is a 
separate reason for concept drift: spammers are engaged in 
a perpetual "arms race" with email filters [71 135|. 
Over time spammers have become increasingly sophisticated 
in their techniques for evading filtering I23| . In its early days, 
spam would have predictable subject lines like MAKE MONEY 
FAST! and Refinance your mortgage. Messages would be 
addressed to UndisclosecLrecipients or nobody. As ba- 
sic header filtering became common in e-mail clients, these 
obvious text markers were simple to filter upon so spam 
could be discarded easily. As message body scanning be- 
came common, fragments such as viagra and click here 



could be checked for as well. 

To circumvent simple filtering, spammers began to employ 
content obscuring techniques such as inserting spurious punc- 
tuation, using bogus HTML tags and adding HTML com- 
ments in the middle of words, ft is now common to see 
fragments such as these: 

• v.ia.g.ra 

• 100°/, Molney Back Guaranltee! 

• Our pro<br2sd9/>duct is doctor reco<br2sd9 
/>mmen<br2sd9/>ded and made from 100°/, 
natu<br2sd9/>ral ingre<br2sd9/>dients . 

• C< ! — 7udzl5315spp6 — >lic< ! — yaj iwnlxnbecx2 — >k 
he< ! — ehcOaj 2pvwu — >re</ a> 

• Increase testosterone by 254°/, 

When rendered, these are recognizable to most people but 
they foil simple word and phrase filtering. To counter this, 
some filters remove embedded punctuation and bogus HTML 
tags before scanning, and consider them to be additional ev- 
idence that the message is spam. 

Spammers are also aware that filters use bayesian word anal- 
ysis and content hashing, so they often pepper their mes- 
sages with common English words and nonsense words to 
foil these techniques 1231 . Messages are designed so these 
words are discarded when the text is rendered, or are ren- 
dered unobtrusively. Graham-Cumming |17| maintains an 
extensive catalog of the techniques used by spammers to 
confuse filters. 

Whatever new filtering capabilities arise, it is just a mat- 
ter of time before spammers find ways to evade them. In 
machine learning terms, spammers have a strong interest in 
making the "spam" and "legitimate" classes indistinguish- 
able. Because the discrimination ability of spam filters im- 
proves continually, the resulting concept drifts. 
Because of this text distortion, in vivo spam filtering di- 
verges significantly from most text classification and infor- 
mation retrieval problems, where authors are not deliber- 
ately trying to obfuscate content and defy indexing. Re- 
searchers should expect that they will have to develop tech- 
niques unknown in these related fields. Much of the effort 
of developing spam filters will probably shift from feature 
combining (i.e., experimenting with different induction al- 
gorithms) to feature generation (i.e., devising automatic fea- 
ture generation methods that can adapt to new distortion 
patterns). 

Such an arms race is not uncommon. Co-evolving abilities 
appear often when access to a desired resource is simulta- 
neously sought and blocked by intelligent, adaptive parties. 
Fraud analysts observe criminals developing increasingly so- 
phisticated techniques in response to security improvements 
II3| . With spam, the desired resource is the attention of 
email users, and spam may be seen as a way of illicitly gain- 
ing access to it. Another example of an arms race occurs 
in e-commerce. As pricing schemes have become more so- 
phisticated, consumers have become more adept at gaming 
the systems. Sophisticated "shop bots" have been devel- 
oped, and on-line merchants have had to develop ways to 
keep pricing information from them. Both sides continue to 
improve their techniques. Finally, the well-publicized con- 
flict between the music swapping networks and the Amer- 
ican music industry shows characteristics of an arms race, 



as both sides develop more sophisticated methods in their 
battle over access to copyrighted material |2 II . 
Such co-adaptation of intelligent agents is foreign to most 
data mining researchers: the data are mined and the results 
are deployed, but the data environment is not considered to 
be an active entity that will react in turn. With the inter- 
net, much information is freely and automatically available 
by all parties, and interactivity is the rule. I propose that 
the future will bring more scenarios involving feedback and 
co-adaptation. Data miners may have to consider the effects 
of mining on their task environment, and perhaps incorpo- 
rate such concerns into the data mining process. Possible 
strategies include concealing one's deployed techniques from 
adversaries, incorporating deception into techniques, or sim- 
ply speeding up the deployment cycle to adapt more quickly 
to adversaries' moves. Spam filtering could be a useful do- 
main in which to explore such strategies. 

3. MEETING THE CHALLENGE 

This paper has made the case that in vivo spam filtering can 
be a complex data mining problem with difficult challenging 
characteristics: 

• Skewed and changing class distributions 

• Unequal and uncertain error costs 

• Complex text patterns requiring sophisticated parsing 

• A disjunctive target concept comprising superimposed 
phenomena with complex temporal characteristics 

• Intelligent, adaptive adversaries 

Researchers wishing to explore these issues would do well to 
study in vivo spam filtering. Controlled laboratory datasets 
exhibiting these characteristics are often difficult to acquire 
and to share. Spam filtering, on the other hand, is an excel- 
lent domain for investigating these problems. 
Researchers wishing to pursue this domain should begin col- 
lecting longitudinal data in a controlled manner. Spam is 
notoriously easy to attract. Several studies have measured 
the extent to which various activities attract spam 1221 [5], 
and this information may be useful. It is easy to create 
ad hoc email addresses (for example, through Hotmail or 
Yahoo) and to advertise them in a controlled manner to 
attract spam. Such addresses are sometimes called "spam 
traps" and are used by email filtering companies such as 
BrightMail to obtain a continuous clean feed of spam for 
analysis. 

A more difficult problem is that of obtaining shareable cor- 
pora of non-spam email, which often contain personal details 
that people want to keep private. Two general approaches 
have been taken: 

1. Researchers who have contributed personal email have 
sought ways to anonymize it. The contributors of the 
UCI "spambase" dataset achieved this by reducing the 
original messages to word frequencies and perform- 
ing feature selection upon the set. Unfortunately, this 
makes it difficult for other researchers to experiment 
with alternative feature selection or text processing op- 
erations on the data. 



Androutsopoulos et al. [U have developed a basic en- 
coding technique for sharing data without compro- 
mising privacy. Their software and several of their 
datasets are available; see Appendix A. 

2. Androutsopoulos et al. [3] have suggested using mes- 
sages from websites and public mailing lists as prox- 
ies for personal email. Their "Ling-spam" corpus uses 
messages from the moderated Linguist list. Other re- 
searchers have suggested that such messages may not 
be representative of the email most people receive. 
Whether this renders mailing list data ineffective for 
exploring in vivo spam filtering remains to be studied. 

However researchers decide to generate such corpora, they 
should consider making their data publicly available. 
This article has outlined the challenges of in vivo spam filter- 
ing and explained how pursuing such challenges could help 
data mining. It is hoped that this article stimulates interest 
in the problem. The appendix and references should serve 
as useful resources for researchers wishing to pursue it. 
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APPENDIX 

A. SOURCES OF SPAM DATA 

There are several sources of spam data on the internet, 
though researchers should be aware of their limitations. 

1. Several static databases have been used by the ma- 
chine learning community. The UCI database "spam- 
base" 4 has a featurized version of spam and legitimate 
email. Androutsopoulos et al. [3] have made avail- 
able several of their corpora containing both spam and 
personal email. All are available for download from 
http : //www. iit .demokritos .gr/ skel/i-conf ig/ Note 
that messages in these databases are not reliably times- 
tamped so they are not useful for measuring time- 
varying aspects of spam. 

2. Paul Wouters, of Extended Internet, has an extensive 



Although his messages are carefully timestamped, note 
his explanation about the large drop around mid-2002, 
attributed to deleting a number of old mail accounts. 
For this reason I avoided making inferences about spam 
volume from his dataset. 



4. SpamArchive.org is a "community resource used for 
testing, developing, and benchmarking anti-spam tools. 
The goal of this project is to provide a large repository 
of spam that can be used by researchers and tool de- 
velopers." Current SpamArchive has over 200K spam 
messages and receives about 5000 messages per day. 

5. Bruce Gucnter has a longitudinal database of spam 
available at http://www.em.ca/~bruceg/spam/ See 
the caveat below about measuring spam volume. 

Note that these datasets are archives of spam saved over 
time and were not designed to be controlled research datasets. 
It is important to understand the limitations of measuring 
spam volume from any of them. They are kept by owners 
of entire sites rather than individual accounts so the spam 
may be extracted from several mailboxes. The mailboxes 
may include admin and webmaster, which are believed to 
receive more spam than average. Some of these administra- 
tors even use "spam trap" addresses deliberately to attract 
spam. Finally, note that being active on Usenet or the web 
can often get a user added to spamming lists — as can mak- 
ing a spam archive available on the web. For all of these 
reasons, these spam archives may contain more spam than 
the average email user typically gets. 
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